Introduction


Business Understanding

Problem statement

To develop models for an insurance company using the Heart Disease dataset from the UCI Machine Learning Repository. The goal is to predict the likelihood of a person developing heart disease, which would help the insurance company estimate health risks and adjust premiums accordingly.


Data Understanding

The dataset contains various features related to patients’ health and demographic information. We will explore the dataset to understand its structure and relationships between variables.

Data dictionary

The dataset contains 14 key attributes that are either numerical or categorical.

Attribute Type Description Constraints/ Rules
age Numerical The age of the patient in years Range: 29-77 (based on dataset statistics)
sex Categorical The gender of the patient Values: 1 = Male, 0 = Female
cp Categorical Type of chest pain experienced by the patient Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic
trestbps Numerical Resting blood pressure of the patient, measured in mmHg Range: Typically, between 94 and 200 mmHg
chol Numerical Serum cholesterol level in mg/dl Range: Typically, between 126 and 564 mg/dl
fbs Categorical Fasting blood sugar level > 120 mg/dl Values: 1 = True, 0 = False
restecg Categorical Results of the patient’s resting electrocardiogram Values: 0 = Normal, 1 = ST-T wave abnormality, 2 = Probable or definite left ventricular hypertrophy
thalach Numerical Maximum heart rate achieved during a stress test Range: Typically, between 71 and 202 bpm
exang Categorical Whether the patient experiences exercise-induced angina Values: 1 = Yes, 0 = No
oldpeak Numerical ST depression induced by exercise relative to rest (an ECG measure) Range: 0.0 to 6.2 (higher values indicate more severe abnormalities)
slope Categorical Slope of the peak exercise ST segment Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping
ca Numerical Number of major vessels colored by fluoroscopy Range: 0-3
thal Categorical Blood disorder variable related to thalassemia Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect
target Categorical Diagnosis of heart disease Values: 0 = No heart disease, 1 = Presence of heart disease


Data Preparation

Data loading

Load the dataset from the UCI website to memory

# Load the dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# Read the dataset into a dataframe
Heart.df <- read.csv(text = getURL(url), header = FALSE, na.strings = "?")

Rename the columns into a meaningful column names

colnames(Heart.df) <- c("age", "sex", "cp", "trestbps", "chol", "fbs",
                        "restecg", "thalach", "exang", "oldpeak", 
                        "slope", "ca", "thal", "target")

Display dimensions of the dataset

dim(Heart.df)
## [1] 303  14

Display the first six rows of the dataset

head(Heart.df)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   target
## 1      0
## 2      2
## 3      1
## 4      0
## 5      0
## 6      0

Display the structure of the dataframe

str(Heart.df)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : num  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
##  $ target  : int  0 2 1 0 0 0 3 0 2 1 ...

Display the statistical summary of the dataframe

summary(Heart.df)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.44   Mean   :0.6799   Mean   :3.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##                                                                  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :241.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.7   Mean   :0.1485   Mean   :0.9901   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##                                                                   
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :2.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.601   Mean   :0.6722  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000   Max.   :3.0000  
##                                                  NA's   :4       
##       thal           target      
##  Min.   :3.000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:0.0000  
##  Median :3.000   Median :0.0000  
##  Mean   :4.734   Mean   :0.9373  
##  3rd Qu.:7.000   3rd Qu.:2.0000  
##  Max.   :7.000   Max.   :4.0000  
##  NA's   :2

Data preprocessing

We will preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features.


Modeling


Evaluation


Deployment


Conclusion